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ABSTRACT 

Solid state drives (SSDs) have seen wide deployment in mobiles, 
desktops, and data centers due to their high I/O performance and 
low energy consumption. As SSDs write data out-of-place, garbage 
collection (GC) is required to erase and reclaim space with invalid 
data. However, GC poses additional writes that hinder the I/O per- 
formance, while SSD blocks can only endure a finite number of 
erasures. Thus, there is a performance-durability tradeoff on the 
design space of GC. To characterize the optimal tradeoff, this pa- 
per formulates an analytical model that explores the full optimal 
design space of any GC algorithm. We first present a stochastic 
Markov chain model that captures the I/O dynamics of large-scale 
SSDs, and adapt the mean-field approach to derive the asymptotic 
steady-state performance. We further prove the model convergence 
and generalize the model for all types of workload. Inspired by 
this model, we propose a randomized greedy algorithm (RGA) that 
can operate along the optimal tradeoff curve with a tunable pa- 
rameter. Using trace-driven simulation on DiskSim with SSD add- 
ons, we demonstrate how RGA can be parameterized to realize the 
performance-durability tradeoff. 
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1. INTRODUCTION 

The increasing adoption of solid-state drives (SSDs) is revolu- 
tionizing storage architectures. Today's SSDs are mainly built on 
NAND flash memory, and provide several attractive features: high 
performance in I/O throughput, low energy consumption, and high 
reliability due to their shock resistance property. As the SSD price 
per gigabyte decreases 11211 . not only desktops are replacing tradi- 
tional hard-disk drives (HDDs) with SSDs, but there is a growing 
trend of using SSDs in data centers 11911271 . 

SSDs have inherently different I/O characteristics from tradi- 
tional HDDs. An SSD is organized in blocks, each of which usually 
contains 64 or 128 pages that are typically of size 4KB each. It sup- 
ports three basic operations: read, write, and erase. The read and 
write operations are performed in a unit of page, while the erase 
operation is performed in the block level. After a block is erased, 
all pages of the block become clean. Each write can only operate 
on a clean page; when a clean page is written, it becomes a valid 
page. To improve the write performance, SSDs use the out-of-place 
write approach. That is, to update data in a valid page, the new data 
is first written to a different clean page, and the original page con- 
taining old data is marked invalid. Thus, a block may contain a mix 
of clean pages, valid pages, and invalid pages. 



The unique I/O characteristics of SSDs pose different design re- 
quirements from those in HDDs. Since each write must operate 
on a clean page, garbage collection (GC) must be employed to re- 
claim invalid pages. GC can be triggered, for example, when the 
number of clean pages drops below a predefined threshold. Dur- 
ing GC, some blocks are chosen to be erased, and all valid pages 
in an erased block must first be written to a different free block 
prior to the erasure. Such additional writes introduce performance 
overhead to normal read/write operations. To maintain high perfor- 
mance, one design requirement of SSDs is to minimize the clean- 
ing cost, such that a GC algorithm chooses blocks containing as 
few valid pages as possible for reclamation. 

However, SSDs only allow each block to tolerate a limited num- 
ber of erasures before becoming unusable. For instance, the num- 
ber is typically 100K for single-level cell (SLC) SSDs and 10K for 
multi-level cell (MLC) SSDs [13). With more bits being stored in 
a flash cell and smaller feature size of flash cells, the maximum 
number of erasures tolerable by each block further decreases, for 
example, to several thousands or even several hundreds for the lat- 
est 3-bits MLC SSDs (23) , Thus, to maintain high durability, an- 
other design requirement of SSDs is to maximize wear-leveling in 
GC, such that all blocks should have similar numbers of erasures 
over time so as to avoid any "hot" blocks being worn out soon. 

Clearly, there is a performance-durability tradeoff in the GC de- 
sign space. Specifically, a GC algorithm with a low cleaning cost 
may not achieve efficient wear-leveling, or vice versa. Prior work 
(e.g., UJ) addressed the tradeoff, but the study is mainly based on 
simulations. From the viewpoints of SSD practitioners, it remains 
an open design issue of how to choose the "best" parameters of a 
GC algorithm to adapt to different tradeoff requirements for differ- 
ent application needs. However, understanding the performance- 
durability tradeoff is non-trivial, since it depends on the I/O dynam- 
ics of an SSD and the dynamics characterization becomes compli- 
cated with the increasing numbers of blocks/pages of the SSD. This 
motivates us to formulate a framework that can efficiently capture 
the optimal design space of GC algorithms and guide the choices 
of parameterizing a GC algorithm to fit any tradeoff requirement. 

In this paper, we develop an analytical model that character- 
izes the I/O dynamics of an SSD and the optimal performance- 
durability tradeoff of a GC algorithm. Using our model as a base- 
line, we propose a tunable GC algorithm for different performance- 
durability tradeoff requirements. To summarize, our paper makes 
the following contributions: 

• We formulate a stochastic Markov chain model that captures 
the I/O dynamics of an SSD. Since the state space of our 
stochastic model increases with the SSD size, we adapt the 
mean field technique (5l l37| to make the model tractable. 
We formally prove the convergence results under the uni- 



form workload to enable us to analyze the steady-state per- 
formance of a GC algorithm. We also discuss how our sys- 
tem model can be extended for a general workload. 

• We identify the optimal extremal points that correspond to 
the minimum cleaning cost and the maximum wear-leveling, 
as well as the optimal tradeoff curve of cleaning cost and 
wear-leveling that enables us to explore the full optimal de- 
sign space of the GC algorithms. 

• Based on our analytical model, we propose a novel GC algo- 
rithm called the randomized greedy algorithm (RGA) that can 
be tunable to operate along the optimal tradeoff curve. RGA 
also introduces low RAM usage and low computational cost. 

• To address the practicality of our work, we conduct extensive 
simulations using the DiskSim simulator [8] with SSD exten- 
sions |1|. We first validate via synthetic workloads that our 
model efficiently characterizes the asymptotic steady-state 
performance. Furthermore, we consider real-world workload 
traces and use trace-driven simulations to study the perfor- 
mance tradeoff and versatility of RGA. 

The rest of the paper proceeds as follows. In Sj2] we propose a 
Markov model to capture the system dynamics of an SSD and con- 
duct the mean field analysis. We formally prove the convergence, 
and further extend the model for a general workload. In |J3] we 
study the design tradeoff between cleaning cost and wear-leveling 
of GC algorithms. In Sj4] we propose RGA and analyze its perfor- 
mance. In [|5] we validate our model via simulations. In $6] we 
present the trace-driven simulation results. In S|7] we review related 
work, and finally in fj8] we conclude the paper. 

2. SYSTEM MODEL 

We formulate a Markov chain model to characterize the I/O dy- 
namics of an SSD under the read, write, and GC operations. We 
then analyze the model via the mean field technique when the SSD 
scales with the increasing number of blocks or storage capacity. 

2.1 Markov Chain Model Formulation 

Our model considers an SSD with TV blocks of k pages each, 
where the typical value of k is 64 or 128 for today's commonly 
used SSDs. Since SSDs use the out-of-place write approach (see 
fJTJ, a write to a logical page may reflect on any physical page. 
Therefore, SSDs implement address mapping to map a logical page 
to a physical page. Address mapping is maintained in the soft- 
ware flash translation layer (FTL) in the SSD controller. It can 
be implemented in block level 1421 . page level 1241 . or hybrid form 
I16I33I39I . A survey of the FTL design including the address map- 
ping mechanisms can be found in 1171 . In this paper, our model 
abstracts out the complexity due to address mapping; specifically, 
we focus on the physical address space and directly characterize 
the I/O dynamics of physical blocks. 

Recall from |JT]that a page can be in one of the three states: clean, 
valid or invalid. We classify each block into a different type based 
on the number of valid pages containing in the block. Specifically, 
a block of type i contains exactly i valid pages. Since each block 
has k pages, a block can be of one of the k + 1 types (i.e., from 
to k valid pages). If a block is of type i, then we say it is in state i. 
Let X n (t) denote the state of block n G {1, N} at time t. Then 
the state descriptor for the whole SSD is 

X N (t) = (X 1 (t),X 2 (t),...,X N (t)), (1) 



where Xi(t) G {0, 1, k}. Thus, the state space cardinality is 
(k + l) N . To facilitate our analysis under the large system regime 
(as we will show later), we transform the above state descriptor to: 

n N (t) = (n (t), ni (t),...,n k {t)), (2) 

where n;(t) G {0, 1, N} denotes the number of type i blocks 
in the SSD. Clearly, we have X^ J= o rl J (*) = N, and the state space 
cardinality is ( N + k ). 

We first describe how different I/O requests affect the system 
dynamics of an SSD from the perspective of physical blocks. The 
I/O requests can be classified into four types: (1) read a page, (2) 
perform GC on a block, (3) program (i.e., write) new data to a page, 
and (4) invalidate a page. First, read requests do not change n N (t). 
For GC, the SSD selects a block, writes all valid pages of that block 
to a clean block, and finally erases the selected block. Thus, GC 
requests do not change the state of n N (t) either. On the other 
hand, for the program and invalidate requests, if the corresponding 
block is of type i, it will move from state i to state i+1 and to state 
i — 1, respectively. 

We now describe the state transition of a block in an SSD. Since 
the read and GC requests do not change n N (t), we only need to 
model the program and invalidate requests. Suppose that the pro- 
gram and invalidate requests arrive as a Poisson process with rate 
A. Also, suppose that the workload is uniform, such that all pages 
in the SSD will have an equal probability of being accessed (in 
m.5\ we extend our model for a general workload). The assump- 
tion of the uniform workload implies that (1) each block has the 
same probability 1 /N of being accessed, (2) the probability of in- 
validating one page is proportional to the number of valid pages of 
the corresponding block, and (3) the probability of programming a 
page is proportional to the total number of invalid and clean pages 
of the corresponding block. Thus, if the requested block is of type 
i, then the probability of invalidating one page of the block is r, 
and that of programming one page in the block is ^jp- . Figure [T] 
illustrates the state transitions of a single block in an SSD under 
the program and invalidate requests. If a block is in state i, the pro- 
gram and invalidate requests will move it to state i+1 at rate A ^~^ 
and to state i— 1 at rate respectively. Note that Figure [TJ only 
shows the state transition of a particular block, but not the whole 
SSD. Specifically, the state space cardinality of a particular block is 
k + 1 as shown in FigureQ] while that of the whole SSD is ( N ^ k ) 
as described by Equation {2j. 

>.(k-l)/Nk ^(k-i+D/Nk Mk-i)/Nk ^/ut 
Jl/Nk 2?t/Nk U/Nk X(i+D/Nk ^/N 

Figure 1: State transition of a block in an SSD. 

To characterize the I/O dynamics of an SSD, we define the occu- 
pancy measure M N (t) as the vector of fraction of type i blocks at 
time f . Formally, we have 

M N (t) = (M (t),Mi(t),...,M fc (t)), 

where Mi (t) is 

?i=i 

In other words, Mi(t) is the fraction of type i blocks in the SSD. 



It is easy to see that the occupancy measure M N (t) is a homoge- 
neous Markov chain. 

We are interested in modeling large-scale SSDs to understand 
the performance implication of any GC algorithms. By large-scale, 
we mean that the number of blocks TV of an SSD is large. For 
example, for a 256GB SSD (which is available in many of today's 
SSD manufactures), we have N mix 10 6 and k = 64 for a page 
size of 4KB, implying a huge state space of M N (t). Since M N (t) 
does not possess any special structure (i.e., matrix-geometric form), 
analyzing it can be computationally expensive. 

2.2 Mean Field Analysis 

To make our Markov chain model tractable for a large-scale SSD, 
we employ the mean field technique f5l l37| . The main idea is that 
the stochastic process M N (t) can be solved by a deterministic pro- 
cess s(t) = (so(t), si(t), Sfc(t)) as TV — > oo, where Si(t) de- 
notes the fraction of blocks of type i at time t in the deterministic 
process. We call s(t) the mean field limit. By solving the deter- 
ministic process s(t), we can obtain the occupancy measure of the 
stochastic process M N (t). 

We introduce the concept of intensity denoted by e(N). Intu- 
itively, the probability that a block performs a state transition per 
time slot is in the order of e(TV). Under the uniform workload, 

each block is accessed with the same probability 1/TV, so we have 

— at 

e(TV) = 1 /TV. Now, we re-scale the process M (f ) to M (t). 



M (te(TV)) = M N (i) Vf>0. 



(4) 



For simplicity, we drop the notation t when the context is clear. 
We now show how the deterministic process s(t) is related to the 

— N 

re-scaled process M (t). The time evolution of the deterministic 
process can be specified by the following set of ordinary differential 
equations (ODEs): 



dsi k — i + 1 i+1 
~~dt = ~ 4 + Z Si-i+A— -j— Sit-i) 

ds 1 
—rr = — Aso + A-Si, 
at k 

ds k 1 

— TT =— XSk + X-S k -l- 

at k 



l<i<k-l, 



(5) 



The idea of the above ODEs is explained as follows. For an 
SSD with TV blocks, we express the expected change in number 
of blocks of type i over a small time period of length dt under 

- — N 

the re-scaled process M (t). This corresponds to the expected 
change over the time period of length TV Git under the original pro- 
cess M N (t). During this period (of length Ndt), there are X(Ndt) 
program/invalidate requests, each of which changes the state of 
some type i block to state i — 1 or state i + 1 with probability 
1/TV, Since there are a total of Nsi blocks of type i, the expected 
change from state i to other states is XNdtSi. Using the similar 
arguments, the expected change in number of blocks from state 
i + 1 to state i is XNdt^-Si+i, and that from state i — 1 to state 
i is XNdt t + Si-i. Similarly, we can also specify the expected 
change in fraction of blocks of type and type k, and we obtain the 
ODEs as stated in Equation ®. 

2.3 Derivation of the Fixed Point 

We now derive the fixed point of the deterministic process in 
Equation l|5}. Specifically, s(t) is said to be a fixed point if s(t) = 
7r implies s(i') = 7r for all t' > t. In other words, the fixed 
point 7r describes the distribution of different types of blocks in the 
steady state. The necessary and sufficient condition for tt to be a 
fixed point is that ^ = for all i £ {0,1, ... , k}. 



Theorem 1 . Equation Q has a unique fixed point tt given by: 





Ti = < i < k. 



(6) 



Proof: First, it is easy to check that 7r satisfies = for < i < k. 
Conversely, based on the condition of = for all i, we have 

k-i+1 i+1 

— 7I»H : TTj-lH 1 7T i + l =0, 1 < I < K—l, 



- 7ro + -7Tl = 0, 

k 

- Tv k + j: n k-i = o. 

By solving these equations, we get 
k 

' Ttk, for < i < k. 



Since ^ 7Ti=l, the fixed point is derived as in Equation ((6). I 

2.4 Summary 

We develop a stochastic Markov chain model to characterize the 
I/O dynamics of a large-scale SSD system. Specifically, we solve 
the stochastic process with a deterministic process via the mean 
field technique and identify the fixed point in the steady state. We 
claim that the derivation is accurate when N is large, as we can 
formally provide that (i) the stochastic process converges to the 
deterministic process as TV — > oo and (ii) the deterministic process 
specified by Equation (O converges to the unique fixed point tt as 
described in Equation ((6). We refer readers to Appendix for the 
convergence proofs. 

Our model enables us to analyze the tradeoff between cleaning 
cost and wear-leveling of GC algorithms. As shown in fj3] cleaning 
cost and wear-leveling can be expressed as functions of 7r. 

2.5 Extensions to General Workload 

Our model thus far focuses on the uniform workload, i.e., all 
physical pages have the same probability of being accessed. For 
completeness, we now generalize our model to allow for the general 
workload, in which blocks/pages are accessed with respect to some 
general probability distribution. We show how we apply the mean 
field technique to approximate the I/O dynamics of an SSD, and 
we also conduct simulations using synthetic workloads to validate 
our approximation (see t]5.U , As stated in t\2.U we focus on the 
program and invalidate requests, both of which can change the state 
of a block in the Markov chain model. We again assume that the 
program/invalidate requests arrive as a Poisson process with rate 
A. In particular, to model the general workload, we let pij be the 
transition probability of a type i block being transited to state j due 
to one program/invalidate request. We have 

Pi,j =0, if j # i - 1 and j / i + 1, 



EEME 1 



L{X n (t)=i} 



1, 



where l{x n (t)=i} indicates whether block n is in state i, and thus 
En l{x„(t)=i} represents the number of blocks in state i. The 
second equation comes from the fact that each program/invalidate 
request can only change the state of one particular block. 

In practice, pij (where j = i — 1 or j = i + 1) can be estimated 
via workload traces. Specifically, for each request being processed, 



one can count the number of blocks in state i (i.e., rii) and the 
number of blocks in state i that change to state j (i.e., rii,j)- Then 
Pij can be estimated as: 



E n i.j 
for each request n s 

total number of requests 



(7) 



where — ^ is the probability that a block transits from state i to j 
in a particular request, and pi.j is the average over all requests. 

We can derive the occupancy measure M N (t) with a determin- 
istic process s(t) specified by the following ODEs: 

dsi 

-jj-= — \(pi,i-l+Pi,H-l) s i + ^Pi-l,i S i-l+ ^Pm,i s H-l , l<i<k-l, 

ds 

—rr= - Apo,iSo + Api, Si, 
at 

ds k 

—rr= - Apk,k-iSk + Ap k -i,kSk-i- 
at 

(8) 

We can further derive the fixed point of the deterministic process 
s(t) as in Theorem [2] For the convergence proof, please refer to 
Appendix. 

Theorem 2. Equation ||§ has a unique fixed point 7T given by: 
1 



7Tfe = 



ni+l 

nfc-1 



(9) 



7TA: j 



< i < fe- 1. 



Proof: The derivation is similar to that of Theorem [TJ I 

3. DESIGN SPACE OF GC ALGORITHMS 

Using our developed stochastic model, we analyze how we can 
parameterize a GC algorithm to adapt to different performance- 
durability tradeoffs. In this section, we formally define two met- 
rics, namely cleaning cost and wear-leveling, for general GC algo- 
rithms. Both metrics are defined based on the occupancy measure 
7r which we derived in Sj2] We identify two optimal extremal points 
in GC algorithms. Finally, we identify the optimal tradeoff curve 
that explores the full optimal design space of GC algorithms. 

3.1 Metrics 

We now define the new parameters that are used to characterize 
a family of GC algorithms. When a GC algorithm is executed, it 
selects a block to reclaim. Let Wi > (where < i < k) denote 
the weight of selecting a particular type i block (i.e., a block with i 
valid pages), such that the higher the weight Wi is, the more likely 
each type i block is chosen to be reclaimed. The weights are chosen 
with the following constraint: 



E 



N 



i=0 



(10) 



The above constraint has the following physical meaning. The ratio 
Wi/N can be viewed as the probability of selecting a particular type 
i block for a GC operation. Since m is the total number of type 
i blocks in the system, Wi-ni can be viewed as the probability of 
selecting any type i block for a GC operation. The summation of 
Willi over all i is equal to 1. Note that 7T; is the occupancy measure 
that we derive in fj2] 

We now define two metrics that respectively characterize the per- 
formance and durability of a GC algorithm. The first metric is 



called the cleaning cost, denoted by C, which is defined as the av- 
erage number of valid pages contained in the block that is selected 
for a GC operation. This implies that the cleaning cost reflects the 
average number of valid pages that need to be written to another 
clean block during a GC operation. The cleaning cost reflects the 
performance of a GC algorithm, such that a high-performance GC 
algorithm should have a low cleaning cost. Formally, we have 



C = iWiTTi 
i=0 



(ID 



The second metric is called the wear-leveling, denoted by W, 
which reflects how balanced the blocks are being erased by a GC 
algorithm. To improve the durability of an SSD, each block should 
have approximately the same number of erasures. We use the con- 
cept of the fairness index 1291 to define the degree of wear-leveling 
W, such that the higher VV is, the more balanced the blocks are 
erased. Formally, we have 



W 



E 



2 

Wi TVi 



(12) 



Note that the rationale of Equation d!2t comes from the fact that ^f- 
is the probability of selecting a particular type i block, and there 
are Niti type i blocks in total. For example, if all Wi's are equal to 
one, which implies that each block has the same probability j? of 
being selected, then the wear-leveling index VV achieves its maxi- 
mum value equal to one as X)j_ 7Tj = 1. 

The set of Wi's, where < i < k, will be our selection parame- 
ters to design a GC algorithm. In the following, we show how we 
select Wi's for different GC algorithms subject to different trade- 
offs between cleaning cost and wear-leveling. Our results are de- 
rived for a general workload subject to the system state distribution 
7r. Specifically, we also derive the closed-form solutions under the 
uniform workload as a case study. 

3.2 GC Algorithm to Maximize Wear-leveling 

Suppose that our goal is to find a set of weight Wi's such that a 
GC algorithm maximizes wear-leveling VV. We can formulate the 
following optimization problem: 



max VV = I WiiTi 



(13) 

i=0 / 

k 

S.t. Willi = 1, 

i=0 

Wi > 0. 

The solution of the above optimization problem is to set Wi = 1 
for all i, and the corresponding wear-leveling VV is equal to 1. Note 

that W < 1 as J2i=q M ? 7ri ~ Ei=o w i^if = X)i=o w i ni ~ 1 ^ 
0, so the above solution is the optimal solution. The corresponding 
cleaning cost is X^ = o * 7r< - m °th er words, each block has the same 
probability (i.e., 1/N) of being selected for GC. Intuitively, this 
assignment strategy which maximizes wear-leveling is the random 
algorithm, in which each block is uniformly chosen independent of 
its number of valid pages. 

Under the uniform workload, we can compute the closed-form 
solution of the cleaning cost C as: 



i=0 i=0 



2 k 



It implies that a random GC algorithm introduces an average of k/2 
additional page writes under the uniform workload. 

3.3 GC Algorithm to Minimize Cleaning Cost 

Suppose now that our goal is to find a set of weight Wi's to min- 
imize the cleaning cost C, or equivalently, minimize the number of 
writes of valid pages during GC. The optimization formulation is: 



W* is given by: 



s.t. 



c = yj iwiiti 

i=0 

k 

^^WiTYi = 1, 

i=0 

Wi > 0. 



(14) 



The solution of the above optimization problem is to set wo = 
1/no and Wi — for all i > (assuming that there exist some 
blocks of type 0), and the cleaning cost C is equal to 0. Since C > 0, 
and it is equal to when Wi = for all i > 0, the solution is opti- 
mal. The corresponding wear-leveling W is no. Intuitively, this 
assignment strategy corresponds to the greedy algorithm, which 
always chooses the block that has the minimum number of valid 
pages for GC. 

Under the uniform workload, the closed-form solution of W cor- 
responding to the minimum cost is given by: 



W 



2 k ' 



The result shows that the greedy algorithm can significantly de- 
grade wear-leveling. For today's commonly used SSDs, the typi- 
cal value of k is 64 or 128. This implies that the degree of wear- 
leveling W ~ 0, and the durability of the SSD suffers. 

3.4 Exploring the Full Optimal Design Space 

We identify two GC algorithms, namely the random and greedy 
algorithms, that correspond to two optimal extremal points of all 
GC algorithms. We now characterize the tradeoff between cleaning 
cost and wear-leveling, and identify the full optimal design space 
of GC algorithms. Specifically, we formulate an optimization prob- 
lem: given a cleaning cost C* , what is the maximum wear-leveling 
tliat a GC algorithm can achieve? Formally, we express the prob- 
lem (with respect to Wi's) as follows: 



(15) 



Jiax W = I Wi 7Ti 

\i=0 

k 

s.t. y~] Willi — i, 

i=0 
k 

iWiTTi = C* , 

i=0 

Wi > 0. 



Without loss of generality, we assume that m > (0 < i < k). 
The solution of the optimization problem is stated in the following 
theorem. 

Theorem 3. Given a cleaning cost C* , the maximum wear-leveling 



Ko, 



W* = { 



v r 2 7r- 

Z^! = £ 1 i 1 " 

I, Tfcl 

for some constants ji, I, Ti, and C 



EI 2 ' 

i=0 7?^ 

1, 

I 



C = 0, 

k 

o < c* < y^j-Ki, 



i=0 
k 



C* = ^ ™i, 

i=0 

k 

^iTTi < C* < k, 

i=0 

C = k, 



(16) 



Proof: The proof is in Appendix. We also derive the constants T, 
ji, C, andTj. I 

4. RANDOMIZED GREEDY ALGORITHM 

In this section, we present a tunable GC algorithm called the 
randomized greedy algorithm (RGA) that can operate at any given 
cleaning cost C* and return the corresponding optimal wear-leveling 
W* ; or equivalently, RGA can operate at any point along the opti- 
mal tradeoff curve of C* and W* . 

4.1 Algorithm Details 

Algorithm[TJshows the pseudo-code of RGA, which operates as 
follows. Each time when GC is triggered, RGA randomly chooses 
d out of iV blocks bi, 62, • • • ,bd as candidates (Step 2). Let v(bi) 
denote the number of valid pages of block bi. Then RGA selects 
the block b* that has the smallest number of valid pages, or the 
minimum «(.), to reclaim (Step 3). We then invalidate block b* 
and move its valid pages to another clean block (Steps 4-5). In 
essence, we define a selection window of window size d that de- 
fines a random subset of d out of iV blocks to be selected. The 
window size d is the tunable parameter that enables us to choose 
between the random and greedy policies. Intuitively, the random 
selection of d blocks allows us to maximize wear-leveling, while 
the greedy selection within the selection window allows us to min- 
imize the cleaning cost. Note that in the special cases where d = 1 
(resp. d — > 00), RGA corresponds to the random (resp. greedy) 
algorithm. 

Algorithm 1 Randomized Greedy Algorithm (RGA) 

1 : if garbage collection is triggered then 

2: randomly choose d blocks bi, &2, bd\ 

3: find block b* = imn v ( b .){bi : bi G {bi, b 2 , b d }}; 

4: write all valid pages in b* to another clean block; 

5: erase b*; 

6: end if 



4.2 Performance Analysis of RGA 

We now derive the cleaning cost and wear-leveling of RGA. We 
first determine the values of weights Wi's for all i. Recall from £13.11 
that Wi-Ki represents the probability of choosing any block of type 
i for GC. In RGA, a type i block is chosen for GC if and only if the 
randomly chosen d blocks all contain at least i valid pages and at 
least one of them contains i valid pages. Thus, the corresponding 
probability wan is Q2j=i n i) d ~ (Ej=i+i 7r d - Note mat this 
expression assumes that d blocks are chosen uniformly at random 
from the TV blocks with replacement, while in RGA, these d blocks 



are chosen uniformly at random without replacement. However, we 
can still use it as approximation since d is much smaller than N for 
a large-scale SSD. Therefore, we have 



(17) 

Based on the definitions of cleaning cost C in Equation i ll lb and 
wear-leveling W in Equation d!2t . we can derive C and W: 



1 



>V 



(18) 
(19) 



In we will show the relationship between cleaning cost C and 
wear-leveling W based on RGA. We show that RGA almost lies on 
the optimal tradeoff curve of C and W. 

4.3 Deployment of RGA 

We now highlight the practical implications when RGA is de- 
ployed. RGA is implemented in the SSD controller as a GC algo- 
rithm. From our evaluation (see details in SJ5}, a small value of d 
(which is significantly less than the number of blocks N) suffices 
to make RGA lie on the optimal tradeoff curve. This allows RGA 
to incur low RAM usage and low computational overhead. Specif- 
ically, RGA only needs to load the meta information (e.g., number 
of valid pages) of d blocks into RAM for comparison. With a small 
value of d, RGA consumes an only small amount of RAM space. 
Also, RGA only needs to compare d blocks to select the block 
with the minimum number of valid pages for GC. The computa- 
tional cost is O(d) and hence very small as well. Since a practical 
SSD controller typically has limited RAM space and computational 
power, we expect that RGA addresses the practical needs and can 
be readily deployed. 

We expect that RGA, like other GC algorithms, is only executed 
periodically or when the number of free blocks drops below a pre- 
defined threshold. The window size d can be tunable at different 
times during the lifespan of the SSD to achieve different levels of 
wear-leveling and cleaning cost along the optimal tradeoff curve. 
In particular, we emphasize that the window size d can be chosen 
as a non-integer. In this case, we can simply linearly extrapolate d 
between [d\ and [d + lj . Formally, for a given non-integer value 
d, when GC is triggered, RGA can set the window size as [d\ with 
probability p and set the window size as [d + lj with probability 
1 — p, where p is given by: 

d = p[d\ + (1 -p)[d+ lj. (20) 

Thus, we can evaluate the values of Wi's as follows: 

Wi(d) =p Wi ([d\) + (1 -p)wi([d+ lj), 

based on Equation dl7t . The cleaning cost and wear-leveling of 
RGA can be computed accordingly via Equations i ll It and H21 
substituting Wi(d). More generally, we can obtain the window size 
from some probability distribution with the mean value given by 
d. This enables us to operate at any point of the optimal tradeoff 



5. MODEL VALIDATION 

We thus far formulate an analytical model that characterizes the 
I/O dynamics of an SSD, and further propose RGA that can be 
tuned to realize different performance-durability tradeoffs. In this 



section, we validate our theoretical results developed in prior sec- 
tions. First, we validate via simulation that our system state deriva- 
tions in Theorem[2]provide accurate approximation even for a gen- 
eral workload. Also, we validate that RGA operates along the opti- 
mal tradeoff curve characterized in Theorem [3] 

5.1 Validation on Fixed-Point Derivations 

Recall from ^J2] that we derive, via the mean field analysis, the 
fixed-point tt for the system state of our model under both uni- 
form and general workloads. We now validate the accuracy of such 
derivation. We use the DiskSim simulator (8) with SSD exten- 
sions |T). We generate synthetic workloads for different read/write 
patterns to drive our simulations, and compare the system state ob- 
tained by each simulation with that of our model. 

We feed the simulations with three different types of synthetic 
workloads: (1) Random, (2) Sequential, and (3) Hybrid. Specifi- 
cally, Random means that the starting address of each I/O request 
is uniformly distributed in the logical address space. Note that its 
definition is (slightly) different from that of the uniform workload 
used in our model, as the latter directly considers the requests in 
the physical address space. The logical-to-physical address map- 
ping will be determined by the simulator. Sequential means that 
each request starts at the address which immediately follows the 
last address accessed by the previous request. Hybrid assumes 
that there are 50% of Random requests and 50% of Sequential 
requests. Furthermore, for each synthetic workload, we consider 
both Poisson and non-Poisson arrivals. For the former, we assume 
that the inter-arrival time of requests follows an exponential distri- 
bution with mean 100ms; for the latter, we assume that the inter- 
arrival time of requests follow a normal distribution (denoted by 
N(n, a 2 )) with mean fj, = 100ms and standard deviation a = 10ms. 

Using simulations, we generate 10M requests for each workload 
and feed them to a small-scale SSD that contains 8 flash packages 
with 160 blocks each. We consider a small-scale SSD (i.e., with 
a small number of blocks) to make the SSD converge to an equi- 
librium state quickly with a sufficient number of requests; in fj6] 
we consider a larger-size SSD. After running all 10M requests, we 
obtain the system state of the SSD for each workload from our sim- 
ulation results. On the other hand, using our model, we first execute 
the workload and record the transition probabilities pij's based on 
Equation 10. We then compute the system state tt using Theo- 
rem[2]for a general workload (which covers the uniform workload 
as well). We then compare the system states obtained from both the 
simulations and model derivations. 

Figure [2] show the simulation and model results for the Ran- 
dom, Sequential, and Hybrid workloads, each associated with ei- 
ther the Poisson or non-Poisson arrivals of requests. The results 
show that under different synthetic workloads, our model derived 
from the mean field technique can still provide good approxima- 
tions of the system state compared with that obtained from the sim- 
ulations. Note that we also observe good approximations even for 
non-Poisson arrivals of requests. The results show the robustness 
of our model in evaluating the system state. 

5.2 Validation on Operational Points of RGA 

In Sj3] we characterize the optimal tradeoff curve between clean- 
ing cost and wear-leveling; in <J4] we present a GC algorithm called 
RGA that can be tuned by a parameter d to adjust the tradeoff be- 
tween cleaning cost and wear-leveling. We now validate that RGA 
can indeed be tuned to operate on the optimal tradeoff curve. 

We consider different system state distributions tt to study the 
performance of RGA. We first consider tt derived for the uniform 
workload (i.e., Equation ((6)). We also consider three different dis- 
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Figure 2: Model validation on the system state tt. In each sub-figure, the x-axis represents the states (i.e., the number of valid pages in a 
block), and the y-axis indicates the state probabilities. 
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Figure 3: Full design space and the performance of RGA. 



tributions of 7r that are drawn from truncated normal distributions, 
denoted by N(fi, a 2 ) with mean /i and standard deviation a. Fig- 
ure [3ja) illustrates the four system state distributions, where the 
mean and variance of each truncated normal distribution are shown 
in the figure. 

For each system state distribution, we compute the maximum 
wear-leveling W* for each cleaning cost C* based on Theorem[3] 
Also, we evaluate the performance of RGA by varying the window 
size d from 1 to 100, and obtain the corresponding cleaning cost 
and wear-leveling based on Equations d 1 3b and | |19) . Here, we only 
focus on the integer values of d. 

Figure |3lb) shows the results, in which the four curves repre- 
sent the optimal tradeoff curves corresponding to the four different 
distributions of tt, while the circles correspond to the operational 
points of RGA with different integer values of window size d from 



1 to 100. Note that the maximum wear-leveling corresponds to 
RGA with window size d — 1 (i.e., the random algorithm). As 
the window size increases, the wear-leveling decreases, while the 
cleaning cost also decreases. We observe that RGA indeed operates 
along the optimal tradeoff curves with regard to different system 
state distributions. 

It is important to note that we can realize non-integer window 
sizes to further fine-tune RGA along the optimal tradeoff curve (see 
i )4.3b . To validate, we consider different values of d from 1 to 2, 
with step size 0.05, and calculate d via linear extrapolation between 
1 and 2. 

Figure (3]c) shows the results for non-integer d using different 
system state distributions. Here, we zoom into the wear-leveling 
values from 0.75 to 1. Each star corresponds to the RGA with a 
non-integer window size obtained by Equation d20l l. We observe 



that RGA can be further fine-tuned to operate along the optimal 
tradeoff curves even when d is a non-integer. 

6. TRACE-DRIVEN EVALUATION 

In this section, we evaluate the performance of RGA under more 
realistic settings. Since today's SSD controllers are mainly pro- 
prietary firmware, it is non-trivial to implement GC algorithms in- 
side a real-world SSD controller. Thus, similar to ^5] we conduct 
our evaluation using the DiskSim simulator J8] with SSD exten- 
sions 1 1). This time we focus on a large-scale SSD. We consider 
several real- world traces, and evaluate different metrics, including 
cleaning cost, I/O throughput, wear-leveling, and durability, for dif- 
ferent GC algorithms. Note that the cleaning cost and wear-leveling 
are the metrics considered in the model, while the I/O throughput 
and durability are the metrics related to user experience. 

Using trace-driven evaluation, our goal is to demonstrate the ef- 
fectiveness of RGA in practical deployment. We compare different 
variants of RGA with regard to different values of window size d, 
as well as the random and greedy algorithms. We emphasize that 
we are not advocating a particular value of d for RGA in real de- 
ployment; instead, we show how different values of d can be tuned 
along the performance-durability tradeoff. 

6.1 Datasets 

We first describe the datasets that drive our evaluation. Since 
the read requests do not influence our analysis, we focus on four 
real-world traces that are all write-intensive: 

• Financial 1431 : It is an I/O trace collected from an online 
transaction process application running at a large financial 
institution. There are two financial traces in 11431 , namely 
Financial l.spc and Financial. spc. Since Financial. spc 
is read-dominant, we only use FinanciaM .spc in this paper. 

• Webmail 1451 : It is an I/O trace that describes the webmail 
workload of a university department mail server. 

• Online |45): It is an I/O trace that describes the coursework 
management workload on Moodle at a university. 

• Webmail+Online J45j: It is the combination of the I/O traces 
of Webmail and Online. 

TableQ]summarizes the statistics of the traces. The original Finan- 
cial trace in (43J contains 24 application-specific units (ASUs) of a 
storage server (denoted by ASUO to ASU23). We study the traces 
of all ASUs except ASU1, ASU3, and ASU5, whose maximum 
logical sector numbers go beyond the logical address space in our 
configured SSD (see ij6.2| l. The remaining Financial trace contains 
around 4.4 million I/O requests, in which 77.82% are write requests 
and the remaining are read requests. Also, 1.67% of I/O requests 
are sequential requests, each of which has its starting address im- 
mediately following the last address of its prior request. The aver- 
age size of each request is 5.4819KB, meaning that most requests 
only access one page as the size of one page is configured as 4KB 
in the simulation. The average inter-arrival time of two continuous 
requests is just around 10 ms. On the other hand, for the Web- 
mail, Online and Webmail+Online traces obtained from (45), the 
write requests account for around 80% of I/O requests, and over 
70% of I/O requests are sequential requests. Moreover, all requests 
in those traces have size 4KB (i.e., only one page is accessed in 
each request), and the average inter-arrival time is much longer 
than that of the Financial trace. In summary, the Financial trace 
has the random-write-dominant access pattern, while the Webmail, 



Online, and Webmail+Online traces have the sequential-write- 
dominant access pattern. 

We set the page size of an SSD as 4KB (the default value in most 
today's SSDs). Since the block size considered by these traces is 
512 bytes, we align the I/O requests of these traces to be multiples 
of the 4KB page size. To enable us to evaluate different GC algo- 
rithms, we need to make the blocks in an SSD undergo a sufficient 
number of program-erase cycles. However, these traces may not 
be long enough to trigger enough block erasures. Thus, we pro- 
pose to replay a trace; that is, in each replay cycle, we make a copy 
of the original trace without changing its I/O patterns, while we 
only change the arrival times of the requests by adding a constant 
value. In our simulations, we replay the traces multiple times so 
that each trace file contains around 50M I/O requests. Since we 
replay a trace, we issue the same write request to a page multiple 
times, and this keeps invalidating pages due to out-of-place writes. 
Thus, many GC operations will be triggered, and this enables us 
to stress-test the cleaning cost and wear-leveling metrics. We point 
out that this replay approach has also been used in the prior SSD 
work (38]. 

6.2 System Configuration 

Table [2] summarizes the parameters that we use to configure an 
SSD in our evaluation. We use the default configurations from the 
simulator whose parameters are based on a common SLC SSD 0131 . 
Specifically, the SSD contains 8 flash packages, each of which 
has its own control bus and data bus, so they can process I/O re- 
quests in parallel. Each flash package contains 8 planes containing 
2048 blocks each. Each block contains 64 pages of size 4KB each. 
Therefore, each flash package contains 16384 physical blocks in 
total and the physical capacity of the SSD is 32GB. For the tim- 
ing parameters, the time to read one page from the flash media to 
the register in the plane is 25^s, and the time of programming one 
page from the register in the plane to the flash media is 0.2ms. For 
an erase operation, it takes 1.5ms to erase one block. The time of 
transferring one byte through the data bus line is 0.025 ^s. Since 
an SSD is usually over-provisioned, we set the over-provisioning 
factor as 15%, which means that the advertised capacity of an SSD 
is only 85% of the physical capacity. Moreover, we set the thresh- 
old of triggering GC as 5%, meaning that GC will be triggered 
when the number of free blocks in the system is smaller than 5%. 
Since flash packages are independent in processing I/O requests, 
GC is also triggered independently in each flash package. In the 
following, we only focus on a single flash package and compare 
the performance of different GC algorithms. 

We consider two different initial states of an SSD before we start 
our simulations. The first one is the empty state, meaning that the 
SSD is entirely clean and no data has been stored. The second one 
is the full state, meaning the SSD is fully occupied with valid data 
and each logical address is always mapped to a physical page con- 
taining valid data. Thus, each write request to a (valid) page will 
trigger an update operation, which writes the new data to a clean 
page and invalidates the original page. Note that the full initial 
state is the default setting in the simulator. In most of our simula- 
tions ( ij6.31 i i6.5l l. we use the full initial state as it can be viewed as 
"stress-testing" the I/O performance of an SSD. When we study the 
durability of SSDs (i j6.6t , we use the empty initial state as it can be 
viewed as the state of a brand-new SSD. 

6.3 Cleaning Cost 

We first evaluate the cleaning cost of different GC algorithms. 
In particular, we execute the traces with each of the GC algorithms 
and record the total number of GC operations and the total number 



Trace 


Total # of requests 


Write ratio 


Sequential ratio 


Avg. request size 


Avg. inter-arrival time 


Financial 


4.4 M 


0.7782 


0.0167 


5.4819 KB 


9.9886 ms 


Webmail 


7.8 M 


0.8186 


0.7868 


4KB 


222.118 ms 


Online 


5.7 M 


0.7388 


0.7373 


4KB 


303.763 ms 


Webmail+Online 


13.5 M 


0.7849 


0.7597 


4KB 


128.302 ms 



Table 1: Workload statistics of traces. 



Parameter 


Value 


page size 


4KB 


# of pages per block 


64 


# of blocks per package 


16384 


# of packages per SSD 


8 


SSD capacity 


32 GB 


read one page 


0.025ms 


write one page 


0.2ms 


erase one block 


1.5ms 


transfer one byte 


0.000025ms 


over-provisioning 


15% 


threshold of triggering GC 


5% 



Table 2: Configuration parameters. 



of valid pages which are written back due to GC. We then derive the 
cleaning cost as the average number of valid pages that are written 
back in each GC operation. 

Figure [4] shows the simulation results. In this figure, there are 
four groups of bars which correspond to the Financial, Webmail, 
Online, and Webmail+Online traces, respectively. In each group, 
there are seven bars which correspond to the greedy algorithm, ran- 
dom algorithm and RGA with different window sizes d. The verti- 
cal axis represents the cleaning cost that each GC algorithm incurs. 
In this simulation, the simulator starts from the full initial state. We 
can see that the greedy algorithm incurs the smallest cleaning cost 
that is almost 0, while the random algorithm has the highest clean- 
ing cost that is close to the total number of pages in each block (i.e., 
fc=64). The intuition is that if the greedy algorithm is used, then for 
every GC operation, the block containing the smallest number of 
valid pages is reclaimed, which means that it only needs to read out 
and write back the smallest number of pages. Therefore, the clean- 
ing cost of the greedy algorithm should be the smallest among all 
algorithms. Moreover, RGA provides a variable cleaning cost be- 
tween the greedy and random algorithms. 
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6.4 Impact on I/O Throughput 

We now consider the impact of different GC algorithms on the 
I/O throughput, using the metric Input/Output Operations Per Sec- 
ond (IOPS). Note that IOPS is an indirect indicator of the clean- 
ing cost. Specifically, the higher the cleaning cost, the more pages 
needed to be moved in each GC operation. This prolongs the dura- 
tion of a GC operation, and leads to smaller IOPS as an I/O request 
must be queued for a longer time until a GC operation is finished. 
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Figure 4: Cleaning cost of different GC algorithms. 



Figure 5: IOPS of different GC algorithms. 

Figure|5]shows the IOPS results of different GC algorithms (note 
that the simulator starts from the full initial state). We can see that 
the greedy algorithm achieves the highest IOPS, and the random 
algorithm has the lowest IOPS, which is less than 5% of the IOPS 
achieved by the greedy algorithm. The results conform to those in 
Figure [4] This means that the cleaning cost, the metric that we use 
in our analytical model, correctly reflects the resulting I/O perfor- 
mance. Again, RGA can provide different I/O throughput results 
with different values of d. 

6.5 Wear-Leveling 

We now evaluate the wear-leveling of different GC algorithms. 
In the simulation, we execute the traces with each of the GC al- 
gorithms and record the number of times that each block has been 
erased. We then estimate the probability that each block is cho- 
sen for GC and derive the wear-leveling based on its definition in 
Equation d!2l >. 

Figure [6] shows the wear-leveling results. It is clear that the ran- 
dom algorithm always achieves the maximum wear-leveling, which 
is almost one. This implies that the random algorithm can ef- 
fectively balance the numbers of erasures across all blocks. On 
the other hand, the greedy algorithm achieves the minimum wear- 
leveling which is less than 0.2 for all traces. Here, we note that 
in all traces, our RGA realizes different levels of wear-leveling be- 
tween the random and greedy algorithms with different values of d. 
In particular, when d < 2, the wear-leveling of RGA is within 80% 
of the maximum wear-leveling of the random algorithm. 
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Figure 6: Wear-leveling of different GC algorithms. 



lifetime of the random algorithm when d = 5. However, it is still 
almost 3 times higher than that of the greedy algorithm. 
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6.6 Impact on Durability 

The previous wear-leveling experiment provides insights into the 
durability (or lifetime) of an SSD. In this evaluation, we focus on 
examining how the durability of an SSD is affected by different GC 
algorithms. 

To study the durability of an SSD, we have to make the SSD con- 
tinue handling a sufficient number of I/O requests until it is worn 
out. In order to speed up our simulation, we decrease the maxi- 
mum number of erasures sustainable by each block to 50. We also 
reduce the size of the SSD such that each flash package contains 
4096 blocks, so the size of each flash package is 1GB. Other con- 
figurations are the same as we described in £]6,2I Also, instead of 
using the real-world traces as in previous simulations, we drive the 
simulation with the synthetic traces that have more aggressive I/O 
rates so that the SSD is worn out soon. Specifically, we consider 
the same set of synthetic traces Random, Hybrid and Sequential 
as described in £15.11 but here we set the mean inter-arrival time of 
I/O requests to be 10ms (as opposed to 100ms in i )5.U based on 
Poisson arrivals. 

Due to the use of bad block management 1361 , an SSD can allow 
a small percentage of bad (worn-out) blocks during its lifetime. 
Suppose that the SSD can allow up to e% of bad blocks for some 
parameter e. To derive the durability of the SSD, we first continue 
running each workload trace on the SSD until e% blocks are worn 
out, i.e., the erasure limit is reached. Then we record the length of 
the duration span that the SSD survives, and take it as the durabil- 
ity of the SSD. For comparison, we normalize the durability with 
respect to that of the greedy algorithm (which is expected to have 
the minimum durability). In this experiment, we consider the case 
where e% = 5%, while we also verify that similar observations are 
made for other values of e% < 10%. Also, we assume that the 
SSD is brand-new (i.e., the initial state is empty) and all blocks 
have no erasure at the beginning. 

Figure [7] shows the results. We observe that the durability re- 
sults of different GC algorithms are consistent with those of wear- 
leveling in Figure|6] We observe that the random algorithm achieves 
the maximum durability, and the value can be almost six times over 
that of the greedy algorithm (e.g., in the Sequential workload). 
Again, RGA provides a tunable durability between the random and 
greedy algorithms. When the window size d < 5, the durability of 
RGA can be within 68% of the maximum lifetime of the random 
algorithm for Random and Hybrid workloads. For the Sequen- 
tial workload, the durability of RGA drops to 40% of the maximum 



Figure 7: Durability of different GC algorithms (normalized with 
respect to the greedy algorithm). 



6.7 Summary 

From the above simulations, we see that the greedy algorithm 
performs the best and the random algorithm performs the worst in 
terms of cleaning cost and I/O throughput, while the opposite holds 
in terms of wear-leveling and durability. We demonstrate that our 
RGA provides a tradeoff spectrum between the two algorithms by 
tuning the window size. This simulation study not only confirms 
our theoretical model, but also shows that our RGA can be viewed 
as an effective tunable algorithm to balance between throughput 
performance and durability of an SSD. 

7. RELATED WORK 

The research on NAND-flash based SSDs has recently received 
a lot of attention. Many aspects of SSDs are being studied. A sur- 
vey on algorithms and data structures for flash memories can be 
found in 1221 . Kawaguchi et al. 1311 propose a flash-based file sys- 
tem based on the log-structured file system design. Birrell et al. (6) 
propose new data structures to improve the write performance of 
SSDs, and Gupta et al. 1251 suggest to exploit value locality and de- 
sign content addressable SSDs so as to optimize the performance. 
Matthews et al. 1351 use NAND-based disk caching to mitigate the 
I/O bottlenecks of HDDs, and Kim et al. I32H consider hybrid stor- 
age by combining SSDs and HDDs. Agrawal et al. (T| study differ- 
ent design tradeoffs of SSDs via a trace-driven simulator based on 
DiskSim (8). Chen et al. 1131 further reveal many intrinsic charac- 
teristics of SSDs via empirical measurements. Poke et al. 11411 also 
study the performance of SSDs via experiments, and Park et al. 1401 
mainly focus on the energy efficiency of SSDs. Note that |T) ad- 
dresses the tradeoff between cleaning cost and wear-leveling in GC, 
but it is mainly based on empirical evaluation. 

A variety of wear-leveling techniques have been proposed, mainly 
from an applied perspective. Some of them are proposed in patents 
[2][3][7),20 26 , 34 46) . Several research papers have been proposed 
to maximize wear-leveling in SSDs based on hot-cold swapping, 
whose main idea is to swap the frequently-used hot data in worn 
blocks and the rarely-used cold data in new blocks. For example, 
Chiang et al. 1 1 411 1 51 propose clustering methods for hot/cold data 
based on access patterns to maximize wear-leveling. Jung et al. 1301 



propose a memory-efficient design for wear-leveling by tracking 
only block groups, while maintaining wear-leveling performance. 
Authors of 1 1 0j| 12| also propose different strategies based on hot- 
cold swapping to further improve the wear-leveling performance. 
Our work differs from above studies in that we focus on character- 
izing the optimal tradeoff of GC algorithms, such that we provide 
flexibility for SSD practitioners to reduce wear-leveling to trade for 
higher cleaning performance. We also propose a tunable GC algo- 
rithm to realize the tradeoff. 

From a theoretical perspective, some studies propose analytical 
frameworks to quantify the performance of GC algorithms. A com- 
parative study between online and offline wear-leveling policies is 
presented in |4). Hu et al. [28] propose a probabilistic model to 
quantify the additional writes due to GC (i.e., the cleaning cost 
defined in our work). They study a modified greedy GC algo- 
rithm, and implement an event-driven simulator to validate their 
model. Bux and Iliadis (9j propose theoretical models to analyze 
the greedy GC algorithm under the uniform workload, and Desnoy- 
ers 1181 also analyzes the performance of LRU and greedy GC algo- 
rithms when page-level address mapping is used. Our work differs 
from them in the following. First, the previous work focuses on an- 
alyzing the write amplification which corresponds to the cleaning 
cost in our paper, but our focus is to analyze the tradeoff between 
cleaning cost and wear-leveling, which are both very important in 
designing GC algorithms, and further explore the design space of 
GC algorithms. Second, our analytical models are also very differ- 
ent. In particular, we use a Markov model to characterize the I/O 
dynamics of SSDs and adapt the mean field technique to approxi- 
mate large-scale systems, then we develop an optimization frame- 
work to derive the optimal tradeoff curve. Finally, our model also 
provides a good approximation under general workload and address 
mapping, and it is further validated via trace-driven evaluation. 

We note that an independent analytical work (44), which is pub- 
lished in the same conference as ours, also applies the mean-field 
technique to analyze different GC algorithms. Its d-choices GC al- 
gorithm has the same construction as our RGA. Our work has the 
following key differences. First, similar to prior analytical stud- 
ies, the work 11441 focuses on write amplification, while we focus 
on the trade-off between cleaning cost and wear leveling. Second, 
its analysis is limited to the uniform workload only, while we also 
address the general workload. Finally, we validate our analysis via 
trace-driven simulations, which are not considered in the work 1441 . 

8. CONCLUSIONS 

In this paper, we propose an analytical model to characterize the 
performance-durability tradeoff of an SSD. We model the I/O dy- 
namics of a large-scale SSD, and use the mean field theory to de- 
rive the asymptotic results in equilibrium. In particular, we classify 
the blocks of an SSD into different types according to the number 
of valid pages contained in each block, and our mean field results 
can provide effective approximation on the fraction of different 
types of blocks in the steady state even under the general workload. 
We define two metrics, namely cleaning cost and wear-leveling, to 
quantify the performance of GC algorithms. In particular, we the- 
oretically characterize the optimal tradeoff curve between cleaning 
cost and wear-leveling, and develop an optimization framework to 
explore the full optimal design space of GC algorithms. Inspired 
from our analytical framework, we develop a tunable GC algo- 
rithm called the randomized greedy algorithm (RGA) which can 
efficiently balance the tradeoff between cleaning cost and wear- 
leveling by tuning the parameter of the window size d. We use 
trace-driven simulation based on DiskSim with SSD add-ons to val- 
idate our analytical model, and show the effectiveness of RGA in 



tuning the performance-durability tradeoff in deployment. 
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APPENDIX 

A. PROOF OF CONVERGENCE 

We now formally prove the convergence of the SSD system state 
under the uniform workload. Our proof consists of two parts. We 
first prove that the stochastic process M (t) indeed converges to 
the deterministic process s(t) when N — > oo. We then prove that 
the deterministic process described in Equation (|5} converges to the 
unique fixed point tt in Equation ((6). 

~ N 

We first show that the re-scaled process M (t) converges to 
s(t). Let us first show several important properties of the stochastic 
process M N (t). 

Lemma 1. Define S = {m e i? fc+1 | Ylt=o m i = 1. m * > Vi}. 
Foranyse S, let f N (s) = E(M N (t + l) - M N (t)\M N (t) = 
s) be the expected change to the occupancy measure in one time 
slot, and let e(N) — 1/N. We have limjv^-oc e(N)= 0, and 



limj> 



/" (£) 
e(N) 



exists for Vs £ 5. 



Proof: Since e(N) = ^ liniAr->oc e(JV) = 0. Denote /^(s) = 
(/o (s), fi(s), /^(s)). Consider the expected change in Mi(t) 
(1 < i < fc — 1) during one time slot when A requests arrive. For 
each request, the probability of changing a block of type i to other 
states is -^Nsi, and the corresponding change in M%(t) is —■h. 
A request may also change blocks of type i + 1 (or type i — 1) 



fc-i+l 

the change in Mi(t) being -L (or 4t). Thus, we have 



to state i, with probability being ^-Nsi+i (or 



Nsi-i) and 



Asi_i, 1 < i < fc— 1. 



Similarly, we can also derive /o (s) and f% (s). Therefore, we 



have lim j 



/<(«) 



Of) 

e(JV) 



ASiH — As i+ i 



/(s), where 

fc-i + l 



/o(s) = - As + -Asi; /^(s) 



-As fc 



1 



Ki<fc-L 



Asfe_ 



Lemma 2. De/me ^(i) as an upper bound on the number of 
blocks that make a transition in time slot t. Then W N (t) satisfies 
E(W N (t) 2 \M N (t) = s) < cN 2 e(N) 2 where c is a constant. 

Proof: During the time slot t, A requests arrive, and for each 
request, it accesses a block with probability i. Therefore, W N (t) 
follows a binomial distribution with parameters -4 and N. 



E(W N (t) 2 \M N (t)-- 



4 



which shows the result in Lemma|2]with c = A 2 + A. I 

Lemma 3. There exists ft > and a function ip(s, a) defined on 
S X [0, fi\ such that ip has continuous derivatives everywhere and 

e(iV) ^y S ' N>- 



Proof: From the proof of Lemma [TJ ^f^y 



/(«). Since /(«) 



is a rational function with respect to s and i, Lemma[3]holds. I 

'AT 

Now, we can now show that the re-scaled process M (t) con- 
verges to s(t), with the following theorem. 

Theorem 4. If M N (0) — > m in probability as N — » oo, f/jen 

— ^JV 

for all T > 0, sup 0<f<T ||M (t) - s(t)\\ — > in probability, 
where s(t) satisfies the ODEs in Equation {5|) and s(0) = m. 



Proof: The theorem holds due to Lemmas Q] [2] and [3] and the 
existing theorem in J5] (Corollary 1). I 
Note that the theorem in (5) provides a way to prove the conver- 
gence to mean field limit, provided that several sufficient conditions 
hold. Therefore, to invoke the theorem in (5), we must explicitly 
verify that our model indeed satisfies the conditions, i.e., Lemma 
11131 so as to make the proof complete. 

Corollary 1. IfM N (0) — > m in probability as N — » oo, then 
for all T > 0, sup < t < T \\M N (t) - s(^)|| ->• in probability, 
where s(t) satisfies the ODEs in Equation {5|) and s(0) = m. 

In the following, we prove that the deterministic process s(t) 
in Equation $5$ converges to the unique fixed point tt in Equation 
(6). Note that the ODEs in Equation ([5} is just a special case of 
the ODEs in Equation l[8}. Since our proof also applies to the gen- 
eral case, to avoid redundancy, we directly present the convergence 
proof of the general case, i.e., the convergence of the ODEs in 
Equation ([§) to the fixed point in Equation (9J. The detailed proof 
is shown in Theorem [5] We thank professor Benny Van Houdt for 
giving us invaluable comments on this proof. 

Theorem 5. The deterministic process s(t) which is specified by 
ODEs @ converges to the fixed point n which is determined by 
Equation JPJl. 



Proof: Note that Equation l[8} can be rewritten as follows. 
ds(t) 



dt 



= s(t)Q, 



(21) 



where s(t) = (s (t), si(t), s k (t)) and Q = [qij]- 



= < 



( >Vi,h 

Apfc,fc_i, 

A(pi,i_i +pt,i+i), 



for? / i, 
for j =i = 0, 
for j = i = fc, 
for < j — i < fc. 



Note that if we treat the state transition of a particular block shown 
in Figure [Tj as a birth-death process, then Equation 1 1211 exactly 
maps to the Kolmogorov's forward equations where Q is just the 
rate matrix of the birth-death process. Therefore, s(t) converges 
to the stationary distribution of the birth-death process n where 
7rQ = 0. We can easily verity that the fixed point n in Equa- 
tion ((9) satisfies the condition trQ = 0, which completes the 
proof. I 
Note that Theorem [5] also completes the proof that the ODEs in 
Equation (O converges to the unique fixed point 7r in Equation 

B. PROOF OF OPTIMAL DESIGN SPACE 

We now prove Theorem [3] in $3.4\ We solve Equation U6\ by 
minimizing the inverse of the objective function, and the problem 
is a convex optimization problem. If a point (w, u, vi, €2) satisfies 
the KKT conditions which are stated in Equation (22b . then w is 
the global minimum. 

(2wiiti — Ui + viTTi + V2iTTi = 0; Ui > 0; Wi > 0; 
UiWi = 0; > Wi-Ki = 1; > iwiiii = C . 

To find a point satisfying the KKT conditions, we first consider 
the case when < C* < X^i=o * 7ri - ^ et 

Xi = min {j : V' im-C* V' tt* > 0}. (23) 



Note that X\ must exist because C* < Ei=n * 7ri an£ ^ Ei=o = 
1. Clearly, we have X\ > C* and Ei=o * 7ri ~~ ^* Ei=o "» > 
for X\ < j < k. Moreover, we have Ei=o ^' Ki Ei=o * 7r « > 
(Zi < j < Now we prove that the following inequality holds. 

E|o^-^E|o^ >Xi (24) 

To prove the inequality l!24t . we rewrite the left hand side of the 
inequality as follows. 



Ef= i7r > - c * E£o *< 



-eta? + by + X\z 



-x + y + z 



where x 



Zi-1 



C*)7Ti and 



a = (Zi — C*)nx 1 . Clearly, we have x > 0,y > 0, z > and 
X\ > 6 > a > 0. Since Ii is the smallest integer which satisfies 
the condition in l !23t , we also have — x+y < and — x + y + z > 0. 
Now, if —ax + by > 0, then inequality ( 1241 ) holds. Otherwise, 



-ax + by + X\z 



X- L + 



(Ji - a)(x -y) + (b- a)y 



—x + y + z —x + y + z 

>Xi( as Xi > b > a > 0, -x + y < 0, and - x + y + z > 0). 

Now, we argue that there exists an X (Xi <X<k) such that 



X < k and X < 
X=k andX< 



Ef= * 27i "i -c*Ef= »t< 



<X+ 1, or 



(25) 



To prove it, we can examine from X\ . Since inequality l !24t holds, 



if Ii < k and 



> X\ + 1, then we have 



> Xi + 1. 



E£q 1 ^ - C E£q 1 "[j 

Efir^-^E?^ 1 ^ 

Therefore, either we find an I such that X< _ 
X + 1 or we reach fc. Now, given the X in Equation l !25t . we define 



Xx=S~~' i 2 n i —C*'S~~' iffi, Yx=S^ in i —C*\~' 7Tj 



By Cauchy's Inequality, we have Zz > 0. If we define 



(26) 



we have ji > 0, for < i < X, and 7; < 0, for X + 1 < i < k. 

We can verify that (w, u, Vi, V2) which is defined as follows 
satisfies the KKT conditions <22l) . Thus, w is the global minimum. 

Si = -2X X /Z X , 
v 2 = 2Y X /Z X , 



Wi = 7i, Mi = 0, < i < X, 

Wi =0, Uj = — 27;7Ti, 2" + 1 < i < fc, 



Similarly, we can find the optimal solution for the case when 
C* > Ei=n ™i- Since the framework of the proof is very similar, 
we only present the solution. In particular, we define 



(E 



l7Ti ) 



where £ is an integer which satisfies the following condition. 
£>0and£>X £ /y£>£-l, or £ = and OX c /Y c . (27) 



If we define 



r, = x £ /z £ -i x y £ /Z£, 



(28) 



we can also verify that (w, u, vi, €2) which is defined as follows 
satisfies the KKT conditions. Therefore, w is the global minimum. 



Si = -2Xc/Z c , 
v 2 = 2Yc/Z c , 



ilii = 0, in = —2ViTii, < i < £ — 1, 
Wi = Ti, Ui = 0, C < i < k. 



The cases when C* = or fc and C* = Ei=o * 7ri correspond 
to the greedy and random algorithms, respectively. Therefore, the 
maximum wear-leveling W* can be derived as in Equation d!6t 
where 74, X, Tj, and £ are defined by Equations Il25t-(l28t. 



